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Abstract 



This paper presents an algorithm for tag- 
ging words whose part-of-speech properties 
are unknown. Unlike previous work, the 



algorithm categoric word tokens in con- 
text ins Lead of word types. The algorithm 

is evaluated on the Brown Corpus. 



1 Introduction 

Since online text becomes available in ever increas- 
ing volumes and an ever increasing number of lan- 
guages, there is a growing need for robust processing 
techniques that can analyze text without expensive 
and time-consuming adaptation to new domains and 
genres . This need motivates research on fulty auto- 



ciples of linguistics and computation, but does not 
depend on knowledge about individual words. 

In this paper, we describe an experiment on fully 
automatic derivation of the knowledge necessary for 
part-of-speech tagging. Part-of-speech tagging is of 
interest for a number of applications, for example 
access to text data bases (Kupiec, 1993), robust 
parsing flAbncy, 1991 ) , and general parsing ( |deMar- 
ckcn, 1990| ; |Charniak et al., 1994| ). The goal is to 



find an unsupervised method for tagging that relies 
on general distributional properties of text, proper- 
ties that are invariant across languages and sublan- 
guages. While the proposed algorithm is not success- 
ful for all grammatical categories, it does show that 
fully automatic tagging is possible when demands on 
accuracy are modest. 

The following sections discuss related work, de- 
scribe the learni ng procedure and evaluate it on the 
Brown Corpus (Francis and Kucera, 1982). 



2 Related Work 

The simplest par t-of-speech tag ge rs are bigram or 
trigram models ( Church, 1989 ; Charniak et al 



1993). They require a relatively large tagged train- 
ing text. Transformation-based tagging as intro- 
duced by prill (1993| ) also requires a hand-tagged 
text for training. No pretagged text is necessary for 



Hidden Markov Models (Jclinck, 1985; Cutting et 



al., 1991; Kupiec, 1992). Still, a lexicon is needed 



that specifies the possible parts of speech for every 



word. Brill and Marcus (1992a) have shown that 
the effort necessary to construct the part-of-speech 
lexicon can be considerably reduced by combining 
learning procedures and a partial part-of-speech cat- 
egorization elicited from an informant. 

The present paper is concerned with tagging lan- 
guages and sublanguages for which no a priori knowl- 
edge about grammatical categories is available, a sit- 



uation that occurs often in practice ( Brill and Mar 



matic text processing that may rely on general prin- cus, 1992a). 



Several researchers have worked on learning gram- 
matical properties of words. Elman (199C) trains 
a connectionist net to predict words, a process 
that generates internal representations that reflect 



grammatical category. Brill et al. (199C) try to in- 
fer grammatical category from bigram statistics. 



Finch and Chatcr (1992] ) and |Finch (1993| ) use vec- 
tor models in which words are clustered according 
to th e similarity of their clo se neighbors in a cor- 
pus. Kneser and Ney (1993 ) present a probabilis- 
tic model for entropy maximization that also relies 
on the imm ediate neighbors of words in a corpus. 
Biber (1993| ) applies factor analysis to collocations 
of two target words ("certain" and "right") with 
their immediate neighbors. 

What these approaches have in common is that 
they classify words instead of individual occurrences. 
Given the widespread part-of-speech ambiguity of 
words this is problematic.^ How should a word like 
"plant" be categorized if it has uses both as a verb 



Although |Biber (199^ ) classifies collocations, these 
can also be ambiguous. For example, "for certain" has 
both senses of "certain" : "particular" and "sure" . 



word 


side 


nearest neighbors 


onto 
onto 


left 
right 


into toward away off together against beside around down 
reduce among regarding against towards plus toward using unlike 


seemed 
seemed 


left 
right 


appeared might would remained had became could must should 
seem seems wanted want going meant tried expect likely 



Table 1: Words with most similar left and right neighbors for "onto" and "seemed". 



and as a noun? How can a categorization be consid- 
ered meaningful if the infinitive marker "to" is not 
distinguished from the homophonous preposition? 

In a previous paper flSchtitze, 1993| ). we trained 
a neural network to disambiguate part-of-speech us- 
ing context; however, no information about the word 
that is to be categorized was used. This scheme fails 
for cases like "The soldiers rarely come home." vs. 
"The soldiers will come home." where the context 
is identical and information about the lexical item 
in question ("rarely" vs. "will") is needed in combi- 
nation with context for correct classification. In this 
paper, we will compare two tagging algorithms, one 
based on classifying word types, and one based on 
classifying words- plus-context. 

3 Tag induction 

We start by constructing representations of the syn- 
tactic behavior of a word with respect to its left and 
right context. Our working hypothesis is that syn- 
tactic behavior is reflected in co-occurrence patterns. 
Therefore, we will measure the similarity between 
two words with respect to their syntactic behavior 
to, say, their left side by the degree to which they 
share the same neighbors on the left. If the counts 
of neighbors are assembled into a vector (with one 
dimension for each neighbor) , the cosine can be em- 
ployed to measure similarity. It will assign a value 
close to 1.0 if two words share many neighbors, and 
0.0 if they share none. We refer to the vector of left 
neighbors of a word as its left context vector, and 
to the vector of right neighbors as its right context 
vector. The unreduced context vectors in the experi- 
ment described here have 250 entries, corresponding 
to the 250 most frequent words in the Brown corpus. 

This basic idea of measuring distributional simi- 
larity in terms of shared neighbors must be modified 
because of the sparseness of the data. Consider two 
infrequent adjectives that happen to modify different 
nouns in the corpus. Their right similarity according 
to the cosine measure would be zero. This is clearly 
undesirable. But even with high-frequency words, 
the simple vector model can yield misleading simi- 
larity measurements. A case in point is "a" vs. "an" . 
These two articles do not share any right neighbors 
since the former is only used before consonants and 



the latter only before vowels. Yet intuitively, they 
are similar with respect to their right syntactic con- 
text despite the lack of common right neighbors. 

Our solution to these problems is the application 
of a singular value decomposition. We can represent 
the left vectors of all words in the corpus as a matrix 
C with n rows, one for each word whose left neigh- 
bors are to be represented, and k columns, one for 
each of the possible neighbors. SVD can be used to 
approximate the row and column vectors of C in a 
low-dimensional space. In more detail, SVD decom- 
poses a matrix C, the matrix of left vectors in our 
case, into three matrices To, So, and Dq such that: 

C — TqSqD' q 

So is a diagonal k-by-k matrix that contains the 
singular values of C in descending order. The ith 
singular value can be interpreted as indicating the 
strength of the ith principal component of C . To 
and Do are orthonormal matrices that approximate 
the rows and columns of C, respectively. By restrict- 
ing the matrices To, So, and Dq to their first m < k 
columns (= principal components) one obtains the 
matrices T, S, and D. Their product C is the best 
least square approximation of C by a matrix of rank 
m: C = TSD' . We chose m = 50 (reduction to 
a 50-dimensional space) for the SVD's described in 
this paper. 

SVD addresses the problems of generalization and 
sparseness because broad and stable generalizations 
are represented on dimensions with large values 
which will be retained in the dimensionality re- 
duction. In contrast, dimensions corresponding to 
small singular values represent idiosyncrasies, like 
the phonological constraint on the usage of "an" vs. 
"a", and will be dropped. We also gain efficiency 
since we can manipulate smaller vectors, reduced to 
50 dimensions. We used SVDPACK to compute the 
sing ular value de compositions described in this pa- 
per ([Berry, 1992j ). 

Table |l] shows the nearest neighbors of two words 
(ordered according to closeness to the head word) 
after the dimensionality reduction. Neighbors with 
highest similarity according to both left and right 
context are listed. One can see clear differences 
between the nearest neighbors in the two spaces. 



The right-context neighbors of "onto" contain verbs 
because both prepositions and verbs govern noun 
phrases to their right. The left-context neighbor- 
hood of "onto" reflects the fact that prepositional 
phrases are used in the same position as adverbs like 
"away" and "together", thus making their left con- 
text similar. For "seemed", left-context neighbors 
are words that have similar types of noun phrases 
in subject position (mainly auxiliaries). The right- 
context neighbors all take "to" -infinitives as comple- 
ments. An adjective like "likely" is very similar to 
"seemed" in this respect although its left context is 
quite different from that of "seemed" . Similarly, the 
generalization that prepositions and transitive verbs 
are very similar if not identical in the way they gov- 
ern noun phrases would be lost if "left" and "right" 
properties of words were lumped together in one rep- 
resentation. These examples demonstrate the im- 
portance of representing generalizations about left 
and right context separately. 

The left and right context vectors are the basis for 
four different tag induction experiments, which are 
described in detail below: 

• induction based on word type only 

• induction based on word type and context 

• induction based on word type and context, re- 
stricted to "natural" contexts 

• induction based on word type and context, us- 
ing generalized left and right context vectors 

3.1 Induction based on word type only 

The two context vectors of a word characterize the 
distribution of neighboring words to its left and 
right. The concatenation of left and right context 
vector can therefore serve as a representation of a 
word's distributional behavior ( Finch and Chaterj 



1992; Schutze, 1993). We formed such concate- 



nated vectors for all 47,025 words (surface forms) 
in the Brown corpus. Here, we use the raw 250- 
dimensional context vectors and apply the SVD to 
the 47,025-by-500 matrix (47,025 words with two 
250-dimensional context vectors each) . We obtained 
47,025 50-dimensional reduced vectors from the SVD 
and clustered them into 200 classes using the fast 



clustering algorithm Buckshot ( Cutting et al., 1992 ) 
(group average agglomeration applied to a sample). 
This classification constitutes the baseline perfor- 
mance for distributional part-of-speech tagging. All 
occurrences of a word are assigned to one class. As 
pointed out above, such a procedure is problematic 
for ambiguous words. 



3.2 Induction based on word type and 
context 

In order to exploit contextual information in the 
classification of a token, we simply use context vec- 
tors of the two words occurring next to the token. 
An occurrence of word w is represented by a con- 
catenation of four context vectors: 

• The right context vector of the preceding word. 

• The left context vector of w. 

• The right context vector of w. 

• The left context vector of the following word. 

The motivation is that a word's syntactic role de- 
pends both on the syntactic properties of its neigh- 
bors and on its own potential for entering into syn- 
tactic relationships with these neighbors. The only 
properties of context that we consider are the right- 
context vector of the preceding word and the left- 
context vector of the following word because they 
seem to represent the contextual information most 
important for the categorization of w. For ex- 
ample, for the disambiguation of "work" in "her 
work seemed to be important" , only the fact that 
"seemed" expects noun phrases to its left is impor- 
tant, the right context vector of "seemed" does not 
contribute to disambiguation. That only the im- 
mediate neighbors are crucial for categorization is 
clearly a simplification, but as the results presented 
below show it seems to work surprisingly well. 

Again, an SVD is applied to address the prob- 
lems of sparseness and generalization. We randomly 
selected 20,000 word triplets from the corpus and 
formed concatenations of four context vectors as de- 
scribed above. The singular value decomposition of 
the resulting 20,000-by-l,000 matrix defines a map- 
ping from the 1,000-dimensional space of concate- 
nated context vectors to a 50-dimensional reduced 
space. Our tag set was then induced by cluster- 
ing the reduced vectors of the 20,000 selected oc- 
currences into 200 classes. Each of the 200 tags is 
defined by the centroid of the corresponding class 
(the sum of its members) . Distributional tagging of 
an occurrence of a word w proceeds then by retriev- 
ing the four relevant context vectors (right context 
vector of previous word, left context vector of follow- 
ing word, both context vectors of w) concatenating 
them to one 1000-component vector, mapping this 
vector to 50 dimensions, computing the correlations 
with the 200 cluster centroids and, finally, assigning 
the occurrence to the closest cluster. This procedure 
was applied to all tokens of the Brown corpus. 



tag 


description 


Penn Treebank tags 


tag 


description 


Penn Treebank tags 


ADN 


adnominal modifier 


ADN* $ 


POS 


possessive marker 


POS 


CC 


conjunction 


CC 


PRP 


pronoun 


PRP 


CD 


cardinal 


CD 


RB 


adverbial 


RB RP RBR RBS 


DT 


determiner 


DT PDT PRP$ 


TO 


infinitive marker 


TO 


IN 


preposition 


IN 


VB 


infinitive 


VB 


ING 


"-ing" forms 


VBG 


VBD 


inflected verb form 


VBD VBZ VBP 


MD 


modal 


MD 


VBN 


predicative 


VBN PRD* 


N 


nominal 


NNP(S) NN(S) 


WDT 


wh-word 


WP($) WRB WDT 



Table 2: Evaluation tag set. Structural tags derived from parse trees are marked with *. 



We will see below that this method of distribu- 
tional tagging, although partially successful, fails 
for many tokens whose neighbors are punctuation 
marks. The context vectors of punctuation marks 
contribute little information about syntactic catego- 
rization since there are no grammatical dependencies 
between words and punctuation marks, in contrast 
to strong dependencies between neighboring words. 

For this reason, a second induction on the basis of 
word type and context was performed, but only for 
those tokens with informative contexts. Tokens next 
to punctuation marks and tokens with rare words 
as neighbors were not included. Contexts with rare 
words (less than ten occurrences) were also excluded 
for similar reasons: If a word only occurs nine or 
fewer times its left and right context vectors capture 
little information for syntactic categorization. In the 
experiment, 20,000 natural contexts were randomly 
selected, processed by the SVD and clustered into 
200 classes. The classification was then applied to 
all natural contexts of the Brown corpus. 

3.3 Generalized context vectors 

The context vectors used so far only capture infor- 
mation about distributional interactions with the 
250 most frequent words. Intuitively, it should be 
possible to gain accuracy in tag induction by us- 
ing information from more words. One way to do 
this is to let the right context vector record which 
classes of left context vectors occur to the right of a 
word. The rationale is that words with similar left 
context characterize words to their right in a simi- 
lar way. For example, "seemed" and "would" have 
similar left contexts, and they characterize the right 
contexts of "he" and "the firefighter" as potentially 
containing an inflected verb form. Rather than hav- 
ing separate entries in its right context vector for 
"seemed" , "would" , and "likes" , a word like "he" 
can now be characterized by a generalized entry for 
"inflected verb form occurs frequently to my right" . 

This proposal was implemented by applying a sin- 
gular value decomposition to the 47025-by-250 ma- 
trix of left context vectors and clustering the result- 



ing context vectors into 250 classes. A generalized 
right context vector v for word w was then formed 
by counting how often words from these 250 classes 
occurred to the right of w. Entry m counts the num- 
ber of times that a word from class i occurs to the 
right of w in the corpus (as opposed to the number 
of times that the word with frequency rank i occurs 
to the right of w). Generalized left context vectors 
were derived by an analogous procedure using word- 
based right context vectors. Note that the infor- 
mation about left and right is kept separate in this 
computation. This differs from previous approaches 
( [Finch and Chater, 1992| |Schiitze, 19931 ) in which 
left and right context vectors of a word are always 
used in one concatenated vector. There are arguably 
fewer different types of right syntactic contexts than 
types of syntactic categories. For example, transitive 
verbs and prepositions belong to different syntac- 
tic categories, but their right contexts are virtually 
identical in that they require a noun phrase. This 
generalization could not be exploited if left and right 
context were not treated separately. 

Another argument for the two-step derivation is 
that many words don't have any of the 250 most 
frequent words as their left or right neighbor. Hence, 
their vector would be zero in the word-based scheme. 
The class-based scheme makes it more likely that 
meaningful representations are formed for all words 
in the vocabulary. 

The generalized context vectors were input to 
the tag induction procedure described above for 
word-based context vectors: 20,000 word triplets 
were selected from the corpus, encoded as 1,000- 
dimcnsional vectors (consisting of four generalized 
context vectors) , decomposed by a singular value de- 
composition and clustered into 200 classes. The re- 
sulting classification was applied to all tokens in the 
Brown corpus. 



4 Results 

The results of the four experiments were evaluated 
by forming 16 classes of tags from the Penn Tree- 
bank as shown in Table p[ Preliminary experiments 



showed that distributional methods distinguish ad- 
nominal and predicative uses of adjectives (e.g. "the 
black cat" vs. "the cat is black"). Therefore the tag 
"ADN" was introduced for uses of adjectives, nouns, 
and participles as adnominal modifiers. The tag 
"PRD" stands for predicative uses of adjectives. The 
Penn Treebank parses of the Brown corpus were used 
to determine whether a token functions as an ad- 
nominal modifier. Punctuation marks, special sym- 
bols, interjections, foreign words and tags with fewer 
than 100 instances were excluded from the evalua- 
tion. 

Tables || and || present results for word type-based 
induction and induction based on word type and 
context. For each tag t, the table lists the frequency 
of t in the corpus ( "frequency" )Q, the number of in- 
duced tags io, i\, . . . ,ii, that were assigned to it 
classes" ) ; the number of times an occurrence of t was 
correctly labeled as belonging to one of io, ii, . . . ,ii 
("correct"); the number of times that a token of a 
different tag t' was miscategorized as being an in- 
stance of io,ii, • ■ • ,ii ("incorrect"); and precision 
and recall of the categorization of t. Precision is 
the number of correct tokens divided by the sum 
of correct and incorrect tokens. Recall is the num- 
ber of correct tokens divided by the total number of 
tokens of t (in the first column). The last column 
gives van Rijsbergen's F measure which computes an 



aggregate scor e from precision and recall: ( van Ri 
jsbergen, 197§ ) F = - , , m- . We chose a = 0.5 



to give equal weight to precision and recall. 

It is clear from the tables that incorporating con- 
text improves performance considerably. The F 
score increases for all tags except CD, with an av- 
erage improvement of more than 0.20. The tag CD 
is probably better thought of as describing a word 
class. There is a wide range of heterogeneous syn- 
tactic functions of cardinals in particular contexts: 
quantificational and adnominal uses, bare NP's ("is 
one of"), dates and ages ("Jan 1", "gave his age as 
25"), and enumerations. In this light, it is not sur- 
prising that the word-type method does better on 
cardinals. 

Table || shows that performance for generalized 
context vectors is better than for word-based context 
vectors (0.74 vs. 0.72). However, since the number 
of tags with better and worse performance is about 
the same (7 and 5), one cannot conclude with cer- 
tainty that generalized context vectors induce tags 



2 The small difference in overall frequency in the tables 
is due to the fact that some word-based context vectors 
consist entirely of zeros. There were about a hundred 
word triplets whose four context vectors did not have 
non-zero entries and could not be assigned a cluster. 



of higher quality. Apparently, the 250 most frequent 
words capture most of the relevant distributional in- 
formation so that the additional information from 
less frequent words available from generalized vec- 
tors only has a small effect. 

Table ^ looks at results for "natural" contexts, 
i.e. those not containing punctuation marks and rare 
words. Performance is consistently better than for 
the evaluation on all contexts, indicating that the 
low quality of the distributional information about 
punctuation marks and rare words is a difficulty for 
successful tag induction. 

Even for "natural" contexts, performance varies 
considerably. It is fairly good for prepositions, 
determiners, pronouns, conjunctions, the infinitive 
marker, modals, and the possessive marker. Tag 
induction fails for cardinals (for the reasons men- 
tioned above) and for "-ing" forms. Present partici- 
ples and gerunds are difficult because they exhibit 
both verbal and nominal properties and occur in a 
wide variety of different contexts whereas other parts 
of speech have a few typical and frequent contexts. 

It may seem worrying that some of the tags are 
assigned a high number of clusters (e.g., 49 for N, 36 
for ADN). A closer look reveals that many clusters 
embody finer distinctions. Some examples: Nouns in 
cluster are heads of larger noun phrases, whereas 
the nouns in cluster 1 are full-fledged NPs. The 
members of classes 29 and 111 function as subjects. 
Class 49 consists of proper nouns. However, there 
are many pairs or triples of clusters that should be 
collapsed into one on linguistic grounds. They were 
separated on distributional criteria that don't have 
linguistic correlates. 

An analysis of the divergence between our classifi- 
cation and the manually assigned tags revealed three 
main sources of errors: rare words and rare syntac- 
tic phenomena, indistinguishable distribution, and 
non-local dependencies. 

Rare words are difficult because of lack of distri- 
butional evidence. For example, "ties" is used as a 
verb only 2 times (out of 15 occurrences in the cor- 
pus). Both occurrences are miscategorized, since its 
context vectors do not provide enough evidence for 
the verbal use. Rare syntactic constructions pose 
a related problem: There are not enough instances 
to justify the creation of a separate cluster. For 
example, verbs taking bare infinitives were classi- 
fied as adverbs since this is too rare a phenomenon 
to provide strong distributional evidence ("we do 
not DARE speak of", "legislation could HELP re- 
move" ) . 

The case of the tags "VBN" and "PRD" (past 
participles and predicative adjectives) demonstrates 
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Table 3: Precision and recall for induction based on word type. 



tag 


frequency 


# classes 


correct 


incorrect 


precision 


recall 


F 


ADN 


108532 


42 


87128 


24743 


0.78 


0.80 


0.79 


CC 


36808 


2 


28671 


1501 


0.95 


0.78 


0.86 


CD 


15084 


1 


747 


809 


0.48 


0.05 


0.09 


DT 


129626 


6 


119534 


6178 


0.95 


0.92 


0.94 


IN 


132079 


11 


125554 


25316 


0.83 


0.95 


0.89 


ING 


14753 


4 


3096 


4876 


0.39 


0.21 


0.27 
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13498 


2 


12983 


936 


0.93 
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0.95 
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231424 


68 


207822 
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VB 
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0.65 
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Table 4: Precision and recall for induction based on word type and context. 
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108586 
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0.77 
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CC 


36808 


4 


34127 


6430 


0.84 


0.93 


0.88 


CD 


15085 


3 


3707 


1530 


0.71 


0.25 


0.36 


DT 


129626 


10 
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0.95 


0.93 


0.94 


IN 
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8 
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22070 


0.85 


0.94 


0.89 
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2 


3798 


7161 


0.35 


0.26 
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3 
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1059 


0.93 


0.98 


0.95 


N 


231434 


70 


201890 


33206 


0.86 


0.87 


0.87 
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5086 


2 


4932 


1636 


0.75 


0.97 
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PRP 
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5 
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0.79 


RB 


54524 


9 


29892 
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0.58 
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25181 
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7 


28879 


6560 


0.81 


0.82 


0.82 


VBD 


80058 


15 


66457 


12079 


0.85 


0.83 


0.84 


VBN 


41145 


10 


26960 


17356 


0.61 


0.66 


0.63 


WDT 


14093 


1 


2223 


563 


0.80 


0.16 


0.26 


avg. 










0.78 


0.73 


0.74 



Table 5: Precision and recall for induction based on generalized context vectors. 



tag 


frequency 


# classes 


correct 


incorrect 


precision 


recall 


F 


ADN 


63771 


36 


54398 


12203 


0.82 


0.85 


0.83 


CC 


16148 


4 


15657 


1798 


0.90 


0.97 


0.93 


CD 


7011 


1 


1857 


918 


0.67 


0.26 


0.38 


DT 


87914 


9 


82206 


2664 


0.97 


0.94 


0.95 


IN 


91950 


9 


86793 


6842 


0.93 


0.94 


0.94 


ING 


7268 


2 


1243 


1412 


0.47 


0.17 


0.25 


MD 


11244 


3 


10363 


476 


0.96 


0.92 


0.94 


N 


111368 


49 


100105 


14452 


0.87 


0.90 


0.89 


POS 


3202 


1 


2912 


255 


0.92 


0.91 


0.91 


PRP 


23946 


7 


22877 


4062 


0.85 


0.96 


0.90 


RB 


32331 


16 


21037 


9922 


0.68 


0.65 


0.66 


TO 


19859 


2 


19537 


53 


1.00 


0.98 


0.99 


VB 


26714 


11 


24036 


4119 


0.85 


0.90 


0.88 


VBD 


56540 


33 


51016 


8488 


0.86 


0.90 


0.88 


VBN 


24804 


14 


18889 


7448 


0.72 


0.76 


0.74 


WDT 


8329 


3 


3691 


670 


0.85 


0.44 


0.58 


avg. 










0.83 


0.78 


0.79 



Table 6: Precision and recall for induction for natural contexts. 



the difficulties of word classes with indistinguish- 
able distributions. There are hardly any distribu- 
tional clues for distinguishing "VBN" and "PRD" 



since both are mainly used as complements of "to 
be" .[j A common tag class was created for "VBN" 
and "PRD" to show that they are reasonably well 
distinguished from other parts of speech, even if not 
from each other. Semantic understanding is neces- 
sary to distinguish between the states described by 
phrases of the form "to be adjective" and the pro- 
cesses described by phrases of the form "to be past 
participle" . 

Finally, the method fails if there are no local de- 
pendencies that could be used for categorization and 
only non-local dependencies are informative. For ex- 
ample, the adverb in "Mc*N. Hester, CURRENTLY 
Dean of ... " and the conjunction in "to add that, IF 
United States policies ..." have similar immediate 
neighbors (comma, NP). The decision to consider 
only immediate neighbors is responsible for this type 
of error since taking a wider context into account 
would disambiguate the parts of speech in question. 

5 Future Work 

There are three avenues of future research we are 
interested in pursuing. First, we are planning to ap- 
ply the algorithm to an as yet untagged language. 
Languages with a rich morphology may be more dif- 
ficult than English since with fewer tokens per type, 
there is less data on which to base a categorization 
decision. 

Secondly, the error analysis suggests that consid- 
ering non-local dependencies would improve results. 



Categories that can be induced well (those charac- 
terized by local dependencies) could be input into 
procedures that learn phrase structure (e.g. (Brill 
and Marcus, 1992b| ; [Finch, 1993) )), These phrase 



constraints could then be incorporated into the dis 
tributional tagger to characterize non-local depen- 
dencies. 

Finally, our procedure induces a "hard" part-of- 
speech classification of occurrences in context, i.e., 
each occurrence is assigned to only one category. It 
is by no means generally accepted that such a classi- 
fication i s linguistica lly adequate. The re is both syn - 
chronic ( Ross, 1972 ) and diachronic ( Tabor, 1994 ) 



3 Because of phrases like "I had sweet potatoes" , forms 
of "have" cannot serve as a reliable discriminator either. 



evidence suggesting that words and their uses can 
inherit properties from several prototypical syntactic 
categories. For example, "fun" in "It's a fun thing to 
do." has properties of both a noun and an adjective 
(superlative "funnest" possible). We are planning 
to explore "soft" classification algorithms that can 
account for these phenomena. 

6 Conclusion 

In this paper, we have attempted to construct an 
algorithm for fully automatic distributional tagging, 
using unannotated corpora as the sole source of in- 
formation. The main innovation is that the algo- 
rithm is able to deal with part-of-speech ambiguity, 
a pervasive phenomenon in natural language that 
was unaccounted for in previous work on learning 
categories from corpora. The method was system- 
atically evaluated on the Brown corpus. Even if no 
automatic procedure can rival the accuracy of hu- 
man tagging, we hope that the algorithm will facili- 
tate the initial tagging of texts in new languages and 
sublanguages. 
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